최적 정규화

오차-분산 트레이드오프

모형 최적화 뒤의 편향 오차(bias)와 오차 분산(variance)는 다음과 같은 트레이드-오프(trade-off) 관계를 가진다. 즉, 어느 하나가 작아지면 다른 하나는 커지게 된다.

함수 $f$를 다른 함수 $\hat{f}$로 모사(approximation)할 때 오차를 $\epsilon$이라고 하면

$$ y = f + \epsilon \approx \hat{f} $$

이다. 이때

$$\text{E}[y] = \text{E}[f + \epsilon] = \text{E}[f] = f$$

편향 오차(bias)는 다음과 같이 정의한다. $$ \begin{align} \text{Bias}[\hat{f}] = f - \text{E}[\hat{f}] \end{align} $$

오차의 분산(Variance)은 다음과 같이 정의한다. $$ \begin{align} \text{Var}[\hat{f}] = \text{E}[ ( \hat{f} - \text{E}[\hat{f}])^2 ] \end{align} $$

이 때 편향 오차와 오차 분산은 다음과 같은 관계가 성립하므로 동시에 줄일 수 없다.

$$ \begin{align} \text{E}[(y - \hat{f})^2] & = (\text{Bias}[\hat{f}(x)])^2 + \text{Var}[\hat{f}(x)] + \text{Var}[\epsilon] \\ \end{align} $$

(증명)

$$ \begin{align} \text{Var}[y] &= \text{E}[(y - \text{E}[y])^2] \\ &= \text{E}[(y - f)^2] \\ &= \text{E}[(f + \epsilon - f)^2] \\ &= \text{E}[\epsilon^2] \\ &= \text{Var}[\epsilon] \end{align} $$$$ \begin{align} \text{E}\big[(y - \hat{f})^2\big] & = \text{E}[y^2] + \text{E}[\hat{f}^2] - \text{E}[2y\hat{f}] \\ & = \text{Var}[y] + \text{E}[y]^2 + \text{Var}[\hat{f}] + \text{E}[\hat{f}]^2 - 2\text{E}[y\hat{f}] \\ & = \text{Var}[y] + f^2 + \text{Var}[\hat{f}] + \text{E}[\hat{f}]^2 - 2f\text{E}[\hat{f}] \\ & = \text{Var}[y] + \text{Var}[\hat{f}] + f^2 - 2f\text{E}[\hat{f}] + \text{E}[\hat{f}]^2 \\ & = \text{Var}[y] + \text{Var}[\hat{f}] + (f - \text{E}[\hat{f}])^2 \\ & = \text{Var}[y] + \text{Var}[\hat{f}] + (\text{Bias}[\hat{f}])^2 \\ \end{align} $$

다항회귀에서 차수를 바꾸어가면서 오차와 분산을 측정하면 다음과 같다.



In [1]:

    
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import cross_val_score

n_samples = 1000
np.random.seed(0)
X = np.sort(np.random.rand(n_samples))
y = np.cos(1.5 * np.pi * X) + np.random.randn(n_samples) * 0.1
X = X[:, np.newaxis]
K = 100

def cv_mse(degree):
    polynomial_features = PolynomialFeatures(degree=degree)
    linear_regression = LinearRegression()
    model = Pipeline([("polynomial_features", polynomial_features),
                      ("linear_regression", linear_regression)])
    scores = -cross_val_score(model, X, y, "mean_squared_error", cv=K)
    return scores



In [2]:

    
scores1 = cv_mse(3)
sns.distplot(scores1)
print(scores1.mean(), scores1.std())









    



0.0130824469849 0.00760649444156



In [3]:

    
D = 9
degrees = 2**np.arange(D)
all_scores = np.zeros((K, D))
for i, d in enumerate(degrees):
    scores = cv_mse(d)
    all_scores[:, i] = scores
df = pd.DataFrame(-np.log(all_scores), columns=degrees)
df.describe()



In [4]:

    
df.mean().plot(kind="bar", rot=0, yerr=df.std())
plt.show()

하이퍼 모수 최적화

하이퍼 모수가 바뀌는 경우에도 마찬가지로 오차-분산 트레이드오프가 성립하므로 최적의 하이퍼 모수를 찾는 작업이 필요하다.



In [11]:

    
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso
from sklearn.cross_validation import cross_val_score

data = load_diabetes()
X = data.data[:200]
y = data.target[:200]

model = Lasso()
alphas = np.logspace(-4, -.5, 50)

scores = list()

for alpha in alphas:
    model.alpha = alpha
    this_scores = cross_val_score(model, X, y, "mean_squared_error", cv=5)
    scores.append(np.mean(this_scores))

plt.semilogx(alphas, scores)
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle=':')
plt.show()



In [12]:

    
from sklearn.linear_model import LassoCV

alphas = np.logspace(-4, -.5, 50)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X, y)

print(lasso_cv.alpha_ )

scores = -lasso_cv.mse_path_.mean(axis=1)
plt.semilogx(lasso_cv.alphas_, scores)
plt.axhline(np.max(scores), linestyle=':')    
plt.axvline(lasso_cv.alpha_ , linestyle=':')   
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.show()









    



0.0439397056076

One Standard Error Rule

실제로 하이퍼 모수 값을 변경해가면서 최적 값은 오차가 존재하는 부정확한 값이다. 이러한 오차를 고려해서 표준 편차(1 standard deviation) 정도의 오차는 감수하더라도 더 단순한(정규화 모형에서는 더 제약 조건이 강한) 모형을 선택하는 것이 실용적이다.



In [13]:

    
from sklearn.linear_model import LassoCV

alphas = np.logspace(-4, -.5, 50)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X, y)

scores = -lasso_cv.mse_path_.mean(axis=1)
scores_std = lasso_cv.mse_path_.std(axis=1)
scores_std1 = scores + scores_std / np.sqrt(len(lasso_cv.mse_path_))
scores_std2 = scores - scores_std / np.sqrt(len(lasso_cv.mse_path_))

alpha_1se = lasso_cv.alphas_[np.argmax(scores_std1 > np.max(scores))]
print(alpha_1se)

plt.semilogx(lasso_cv.alphas_, scores)
plt.semilogx(lasso_cv.alphas_, scores_std1, 'o-')
plt.semilogx(lasso_cv.alphas_, scores_std2, 'o-')
plt.axhline(np.max(scores), linestyle=':')    
plt.axvline(lasso_cv.alpha_ , linestyle=':')   
plt.axvline(alpha_1se, linestyle=':')   
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.show()









    



0.117876863479

	1	2	4	8	16	32	64	128	256
count	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000
mean	2.152711	3.431163	4.735579	4.757005	4.746503	4.659619	4.560883	4.521579	4.436088
std	1.197341	0.801431	0.470647	0.464274	0.481904	0.745300	1.280429	1.811851	2.616881
min	-0.143751	1.023711	3.757597	3.746899	3.703366	-0.956651	-6.694831	-12.489364	-20.971026
25%	1.283354	2.877161	4.449526	4.447033	4.426598	4.389520	4.322019	4.319936	4.345054
50%	1.912740	3.341536	4.659961	4.698884	4.686885	4.667370	4.689061	4.676808	4.664626
75%	2.922163	3.963974	5.023880	5.012029	5.031434	5.040994	5.025874	5.035373	5.013957
max	5.195028	5.061568	6.386555	6.284238	6.139301	6.346065	6.015190	6.540089	6.387756